Introduction

In homework 2 you will fit many regression models. You are welcome to explore beyond what the question is asking you.

Please come see us we are here to help.

Data analysis

Analysis of earnings and height data

The folder earnings has data from the Work, Family, and Well-Being Survey (Ross, 1990). You can find the codebook at http://www.stat.columbia.edu/~gelman/arm/examples/earnings/wfwcodebook.txt

gelman_dir <- "http://www.stat.columbia.edu/~gelman/arm/examples/"
heights    <- read.dta (paste0(gelman_dir,"earnings/heights.dta"))

Pull out the data on earnings, sex, height, and weight.

  1. In R, check the dataset and clean any unusually coded data.
library(tidyverse)
## ─ Attaching packages ────────────────────────────── tidyverse 1.2.1 ─
## ✔ tibble  1.4.2     ✔ purrr   0.2.5
## ✔ tidyr   0.8.1     ✔ dplyr   0.7.6
## ✔ readr   1.1.1     ✔ stringr 1.3.1
## ✔ tibble  1.4.2     ✔ forcats 0.3.0
## ─ Conflicts ─────────────────────────────── tidyverse_conflicts() ─
## ✖ dplyr::between()   masks data.table::between()
## ✖ tidyr::expand()    masks Matrix::expand()
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::first()     masks data.table::first()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ dplyr::last()      masks data.table::last()
## ✖ dplyr::select()    masks MASS::select()
## ✖ purrr::transpose() masks data.table::transpose()
map_dbl(heights, ~ sum(is.na(.x)))
##    earn height1 height2     sex    race    hisp      ed  yearbn  height 
##     650       8       6       0       0       0       0       0       8
na<-which(!complete.cases(heights))
heights_withoutNa<-heights[-na,]
map_dbl(heights_withoutNa, ~ sum(is.na(.x)))
##    earn height1 height2     sex    race    hisp      ed  yearbn  height 
##       0       0       0       0       0       0       0       0       0
zero_earn<-which(heights_withoutNa$earn==0)
heights_clean<-heights_withoutNa[-zero_earn,]
  1. Fit a linear regression model predicting earnings from height. What transformation should you perform in order to interpret the intercept from this model as average earnings for people with average height?
# For this model, we should use center on mean in height
center_height<-heights_clean$height-mean(heights_clean$height)
mod2<-lm(earn~center_height,data=heights_clean)
summary(mod2)
## 
## Call:
## lm(formula = earn ~ center_height, data = heights_clean)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -30166 -11309  -3428   6527 172953 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    23154.8      546.4  42.376   <2e-16 ***
## center_height   1262.3      142.1   8.883   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18870 on 1190 degrees of freedom
## Multiple R-squared:  0.06218,    Adjusted R-squared:  0.06139 
## F-statistic:  78.9 on 1 and 1190 DF,  p-value: < 2.2e-16
plot(mod2)

  1. Fit some regression models with the goal of predicting earnings from some combination of sex, height, and weight. Be sure to try various transformations and interactions that might make sense. Choose your preferred model and justify.
# log transformation on log(earn+1), and I have already cleaned all zeros in earn
mod3<-lm(log(earn)~height+sex+ed,data=heights_clean)
summary(mod3)
## 
## Call:
## lm(formula = log(earn) ~ height + sex + ed, data = heights_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4496 -0.3379  0.1320  0.5151  2.1989 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.252336   0.673976  12.244  < 2e-16 ***
## height       0.009061   0.008875   1.021    0.307    
## sex         -0.467729   0.068733  -6.805  1.6e-11 ***
## ed           0.117962   0.010057  11.730  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8343 on 1188 degrees of freedom
## Multiple R-squared:  0.1814, Adjusted R-squared:  0.1793 
## F-statistic: 87.73 on 3 and 1188 DF,  p-value: < 2.2e-16
plot(mod3)

# z-score transformation on height
height_z<-center_height/sd(heights_clean$height)
mod33<-lm(earn~height_z+sex+ed,data=heights_clean)
summary(mod33)
## 
## Call:
## lm(formula = earn ~ height_z + sex + ed, data = heights_clean)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -40945 -10027  -2379   5763 158605 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3418.4     3578.7   0.955    0.340    
## height_z       706.0      715.7   0.986    0.324    
## sex         -10083.5     1440.9  -6.998 4.32e-12 ***
## ed            2638.5      210.8  12.515  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17490 on 1188 degrees of freedom
## Multiple R-squared:  0.1953, Adjusted R-squared:  0.1933 
## F-statistic: 96.11 on 3 and 1188 DF,  p-value: < 2.2e-16
plot(mod33)

# I prefer to the log transformation on log(earn+1), because there are some very big earn in the data, which definitely influence the graph and linear model.
  1. Interpret all model coefficients.
# For the  log transformation on log(earn+1), we can interpret all model coefficients like 
# Earn= exp(8.25)*exp(0.009)^height*exp(0.117)^ed/exp(0.468)^sex
# The coefficients mean that when height=0, ed=0 and sex=0, Earn=exp(8.25). Each unit increase in height makes Earn exp(0.009)times higher.Each unit increase in sd makes Earn exp(0.117)times higher.Each unit increase in sex makes Earn exp(0.468)times lower.
  1. Construct 95% confidence interval for all model coefficients and discuss what they mean.
round(confint(mod2,level=0.95))
##               2.5 % 97.5 %
## (Intercept)   22083  24227
## center_height   984   1541
# It means that we have 95% confidence in mod2 that Earn is in the terval(22082,24226)
round(confint(mod3,level=0.95))
##             2.5 % 97.5 %
## (Intercept)     7     10
## height          0      0
## sex            -1      0
## ed              0      0
# It means that we have 95% confidence in mod3 that log(Earn) is in the terval(6.93,9.57)

Analysis of mortality rates and various environmental factors

The folder pollution contains mortality rates and various environmental factors from 60 U.S. metropolitan areas from McDonald, G.C. and Schwing, R.C. (1973) ‘Instabilities of regression estimates relating air pollution to mortality’, Technometrics, vol.15, 463-482.

Variables, in order:

  • PREC Average annual precipitation in inches
  • JANT Average January temperature in degrees F
  • JULT Same for July
  • OVR65 % of 1960 SMSA population aged 65 or older
  • POPN Average household size
  • EDUC Median school years completed by those over 22
  • HOUS % of housing units which are sound & with all facilities
  • DENS Population per sq. mile in urbanized areas, 1960
  • NONW % non-white population in urbanized areas, 1960
  • WWDRK % employed in white collar occupations
  • POOR % of families with income < $3000
  • HC Relative hydrocarbon pollution potential
  • NOX Same for nitric oxides
  • SO@ Same for sulphur dioxide
  • HUMID Annual average % relative humidity at 1pm
  • MORT Total age-adjusted mortality rate per 100,000

For this exercise we shall model mortality rate given nitric oxides, sulfur dioxide, and hydrocarbons as inputs. This model is an extreme oversimplification as it combines all sources of mortality and does not adjust for crucial factors such as age and smoking. We use it to illustrate log transformations in regression.

gelman_dir   <- "http://www.stat.columbia.edu/~gelman/arm/examples/"
pollution    <- read.dta (paste0(gelman_dir,"pollution/pollution.dta"))
  1. Create a scatterplot of mortality rate versus level of nitric oxides. Do you think linear regression will fit these data well? Fit the regression and evaluate a residual plot from the regression.
library(ggplot2)
ggplot(data=pollution)+geom_point(mapping = aes(x=mort,y=nox))

mod1<-lm(mort~nox,data=pollution)
summary(mod1)
## 
## Call:
## lm(formula = mort ~ nox, data = pollution)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -148.654  -43.710    1.751   41.663  172.211 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 942.7115     9.0034 104.706   <2e-16 ***
## nox          -0.1039     0.1758  -0.591    0.557    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 62.55 on 58 degrees of freedom
## Multiple R-squared:  0.005987,   Adjusted R-squared:  -0.01115 
## F-statistic: 0.3494 on 1 and 58 DF,  p-value: 0.5568
# The regression will not fit these data well
plot(resid(mod1))

plot(mod1)

  1. Find an appropriate transformation that will result in data more appropriate for linear regression. Fit a regression to the transformed data and evaluate the new residual plot.
# log transformation on log(nox+1) due to some very big values in nox in the scatter plot
mod22<-lm(mort~log(nox+1),data=pollution)
summary(mod22)
## 
## Call:
## lm(formula = mort ~ log(nox + 1), data = pollution)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -165.622  -29.433    8.987   36.534  166.293 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   901.597     20.078  44.904   <2e-16 ***
## log(nox + 1)   15.661      7.474   2.096   0.0405 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 60.49 on 58 degrees of freedom
## Multiple R-squared:  0.07038,    Adjusted R-squared:  0.05435 
## F-statistic: 4.391 on 1 and 58 DF,  p-value: 0.0405
plot(mod22)

  1. Interpret the slope coefficient from the model you chose in 2.
# According to the summary of mod22, we get mort=901.6+15.661log(nox)
# When nox=1, mort=901.6. Each unit increase in log(nox) leads to 15.661 increase in mort.
  1. Construct 99% confidence interval for slope coefficient from the model you chose in 2 and interpret them.
round(confint(mod22,level=0.99),2) 
##               0.5 % 99.5 %
## (Intercept)  848.12 955.07
## log(nox + 1)  -4.24  35.57
# The 99% CI for the slope is (-4.24,35.57), which across 0. Therefore, it is not significant.
  1. Now fit a model predicting mortality rate using levels of nitric oxides, sulfur dioxide, and hydrocarbons as inputs. Use appropriate transformations when helpful. Plot the fitted regression model and interpret the coefficients.
# We use log transformation on both nox and hc, and center on mean in so2
so2_center<-pollution$so2-mean(pollution$so2)
mod5<-lm(mort~log(nox)+so2_center+log(hc),data=pollution)
summary(mod5)
## 
## Call:
## lm(formula = mort ~ log(nox) + so2_center + log(hc), data = pollution)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -98.262 -35.757  -0.413  37.602 171.152 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 957.8402    21.4379  44.680   <2e-16 ***
## log(nox)     56.0779    22.7132   2.469   0.0166 *  
## so2_center    0.2638     0.1654   1.594   0.1165    
## log(hc)     -53.6761    20.1715  -2.661   0.0101 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 54.43 on 56 degrees of freedom
## Multiple R-squared:  0.2733, Adjusted R-squared:  0.2344 
## F-statistic:  7.02 on 3 and 56 DF,  p-value: 0.0004336
plot(mod5)

# The linear regression mod is that mort=957.84+56.078*log(nox)+0.264so2_center-53.676*log(hc)
# The interpret of the coefficients: when nox=1, so2=0 and hc=1, mort=957.84. Each unit increase in log(nox) leads to 56.078 increase in mort. Each unit increase in so2_center leads to 0.264 increase in mort. Each unit increase in log(hc) leads to 53.676 decrease in mort.
  1. Cross-validate: fit the model you chose above to the first half of the data and then predict for the second half. (You used all the data to construct the model in 4, so this is not really cross-validation, but it gives a sense of how the steps of cross-validation can be implemented.)
data_firsthalf<-pollution[1:30,]
so2_center<-data_firsthalf$so2-mean(data_firsthalf$so2)
mod6<-lm(mort~log(nox)+so2_center+log(hc),data=data_firsthalf)
PI <- predict(mod6 , interval = "prediction")
## Warning in predict.lm(mod6, interval = "prediction"): predictions on current data refer to _future_ responses
PI
##          fit      lwr      upr
## 1   948.5196 838.2094 1058.830
## 2   944.6024 828.5043 1060.701
## 3   940.2922 827.7335 1052.851
## 4   931.5208 819.4004 1043.641
## 5  1005.7045 886.5626 1124.846
## 6   956.9330 841.0506 1072.815
## 7   955.7156 832.4293 1079.002
## 8   926.2348 815.3447 1037.125
## 9   939.6170 829.0159 1050.218
## 10  931.8108 821.3922 1042.229
## 11  932.6385 820.5771 1044.700
## 12 1031.5490 899.9133 1163.185
## 13  983.8440 870.1099 1097.578
## 14  950.3428 839.2196 1061.466
## 15  927.3887 813.3318 1041.446
## 16  926.5966 806.8878 1046.305
## 17  930.7057 820.0210 1041.390
## 18  933.4131 821.9097 1044.916
## 19  973.1574 861.2121 1085.103
## 20  924.5802 809.4939 1039.666
## 21  926.5966 806.8878 1046.305
## 22  927.4446 815.9921 1038.897
## 23  922.2764 807.0161 1037.537
## 24  925.0784 811.3795 1038.777
## 25  926.9072 814.7143 1039.100
## 26  936.0913 825.6355 1046.547
## 27  925.1508 814.1932 1036.108
## 28  936.8935 826.9262 1046.861
## 29  975.3800 845.4208 1105.339
## 30 1001.5165 883.9411 1119.092

Study of teenage gambling in Britain

data(teengamb)
?teengamb
  1. Fit a linear regression model with gamble as the response and the other variables as predictors and interpret the coefficients. Make sure you rename and transform the variables to improve the interpretability of your regression model.
center_verbal<-teengamb$verbal-mean(teengamb$verbal)
mod11<-lm(log(gamble+1)~log(income+1)+sex+status+center_verbal,data=teengamb)
summary(mod11)
## 
## Call:
## lm(formula = log(gamble + 1) ~ log(income + 1) + sex + status + 
##     center_verbal, data = teengamb)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5829 -0.5913  0.1162  0.8323  2.0818 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -0.76617    0.98339  -0.779  0.44028    
## log(income + 1)  1.25898    0.30774   4.091  0.00019 ***
## sex             -1.03618    0.39263  -2.639  0.01161 *  
## status           0.02622    0.01354   1.937  0.05946 .  
## center_verbal   -0.25497    0.10613  -2.402  0.02079 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.109 on 42 degrees of freedom
## Multiple R-squared:  0.4994, Adjusted R-squared:  0.4517 
## F-statistic: 10.48 on 4 and 42 DF,  p-value: 5.615e-06
plot(mod11)

  1. Create a 95% confidence interval for each of the estimated coefficients and discuss how you would interpret this uncertainty.
round(confint(mod11,level=0.95),2) 
##                 2.5 % 97.5 %
## (Intercept)     -2.75   1.22
## log(income + 1)  0.64   1.88
## sex             -1.83  -0.24
## status           0.00   0.05
## center_verbal   -0.47  -0.04
# None of these CI of coefficients across 0. Therefore, they are all significant
  1. Predict the amount that a male with average status, income and verbal score would gamble along with an appropriate 95% CI. Repeat the prediction for a male with maximal values of status, income and verbal score. Which CI is wider and why is this result expected?
predict_mean<-predict(lm(log(gamble+1)~log(income+1)+sex+status+verbal,data=teengamb),data.frame(sex=0,status=mean(teengamb$status),income=mean(teengamb$income),verbal=mean(teengamb$verbal)),level=0.95,interval = "prediction")
predict_max<-predict(lm(log(gamble+1)~log(income+1)+sex+status+verbal,data=teengamb),data.frame(sex=0,status=max(teengamb$status),income=max(teengamb$income),verbal=max(teengamb$verbal)),level=0.95,interval = "prediction")
predict_mean-predict_max
##         fit        lwr       upr
## 1 -1.241176 -0.9959947 -1.486358
# PI for a male with maximal value is wider

School expenditure and test scores from USA in 1994-95

data(sat)
?sat
  1. Fit a model with total sat score as the outcome and expend, ratio and salary as predictors. Make necessary transformation in order to improve the interpretability of the model. Interpret each of the coefficient.
center_salary<-sat$salary-mean(sat$salary)
center_ratio<-sat$ratio-mean(sat$ratio)
mod111<-lm(total~center_salary+center_ratio+log(expend),data=sat)
summary(mod111)
## 
## Call:
## lm(formula = total ~ center_salary + center_ratio + log(expend), 
##     data = sat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -142.750  -46.921    1.093   47.136  122.776 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    872.174    240.323   3.629 0.000711 ***
## center_salary   -7.214      4.600  -1.568 0.123703    
## center_ratio     4.692      6.770   0.693 0.491734    
## log(expend)     53.510    137.063   0.390 0.698040    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 68.95 on 46 degrees of freedom
## Multiple R-squared:  0.2027, Adjusted R-squared:  0.1507 
## F-statistic: 3.897 on 3 and 46 DF,  p-value: 0.01457
plot(mod111)

# Interpret each of the coefficient: when center_salary is 0, center_ratio is 0 and expend is 1, total is 872.174. Each unit increase in center_salary leads to 7.214 decrease in total score. Each unit increase in center_ratio leads to 4.692 increase in total score. Each unit increase in log(expend) leads to 53.51 increase in total score.
  1. Construct 98% CI for each coefficient and discuss what you see.
round(confint(mod111,level=0.98),2) 
##                   1 %    99 %
## (Intercept)    292.95 1451.40
## center_salary  -18.30    3.87
## center_ratio   -11.62   21.01
## log(expend)   -276.84  383.86
# All of the CI of the coefficients across 0, which means that are all not significant.
  1. Now add takers to the model. Compare the fitted model to the previous model and discuss which of the model seem to explain the outcome better?
center_salary<-sat$salary-mean(sat$salary)
center_ratio<-sat$ratio-mean(sat$ratio)
mod333<-lm(total~center_salary+center_ratio+log(expend)+takers,data=sat)
plot(resid(mod333))

# The model with takers seems to explain the outcome better. Because its residual plot is more balanced through resid=0.

Conceptual exercises.

Special-purpose transformations:

For a study of congressional elections, you would like a measure of the relative amount of money raised by each of the two major-party candidates in each district. Suppose that you know the amount of money raised by each candidate; label these dollar values \(D_i\) and \(R_i\). You would like to combine these into a single variable that can be included as an input variable into a model predicting vote share for the Democrats.

Discuss the advantages and disadvantages of the following measures:

  • The simple difference, \(D_i-R_i\) Advantages: Easy to compute and have direct difference Disadvanges: Lack potential meaning and further imply
  • The ratio, \(D_i/R_i\) Advantages: Easy to compute Disadvanges: Hard to increase by over 1, which means in the model predicting, it is not convenient for us to increase it by unit to see the change of predictor
  • The difference on the logarithmic scale, \(log D_i-log R_i\) Advantages: Useful to deal with the data with large Di or large Ri Disadvanges:Hard to compute
  • The relative proportion, \(D_i/(D_i+R_i)\). Advantages: Easy to compute and the proportion is useful to test percentage Disadvanges: Hard to increase by over 1, which means in the model predicting, it is not convenient for us to increase it by unit to see the change of predictor

Transformation

For observed pair of \(\mathrm{x}\) and \(\mathrm{y}\), we fit a simple regression model \[\mathrm{y}=\alpha + \beta \mathrm{x} + \mathrm{\epsilon}\] which results in estimates \(\hat{\alpha}=1\), \(\hat{\beta}=0.9\), \(SE(\hat{\beta})=0.03\), \(\hat{\sigma}=2\) and \(r=0.3\).

  1. Suppose that the explanatory variable values in a regression are transformed according to the \(\mathrm{x}^{\star}=\mathrm{x}-10\) and that \(\mathrm{y}\) is regressed on \(\mathrm{x}^{\star}\). Without redoing the regression calculation in detail, find \(\hat{\alpha}^{\star}\), \(\hat{\beta}^{\star}\), \(\hat{\sigma}^{\star}\), and \(r^{\star}\). What happens to these quantities when \(\mathrm{x}^{\star}=10\mathrm{x}\) ? When \(\mathrm{x}^{\star}=10(\mathrm{x}-1)\)?

  2. Now suppose that the response variable scores are transformed according to the formula \(\mathrm{y}^{\star\star}= \mathrm{y}+10\) and that \(\mathrm{y}^{\star\star}\) is regressed on \(\mathrm{x}\). Without redoing the regression calculation in detail, find \(\hat{\alpha}^{\star\star}\), \(\hat{\beta}^{\star\star}\), \(\hat{\sigma}^{\star\star}\), and \(r^{\star\star}\). What happens to these quantities when \(\mathrm{y}^{\star\star}=5\mathrm{y}\) ? When \(\mathrm{y}^{\star\star}=5(\mathrm{y}+2)\)?

  3. In general, how are the results of a simple regression analysis affected by linear transformations of \(\mathrm{y}\) and \(\mathrm{x}\)?

  4. Suppose that the explanatory variable values in a regression are transformed according to the \(\mathrm{x}^{\star}=10(\mathrm{x}-1)\) and that \(\mathrm{y}\) is regressed on \(\mathrm{x}^{\star}\). Without redoing the regression calculation in detail, find \(SE(\hat{\beta}^{\star})\) and \(t^{\star}_0= \hat{\beta}^{\star}/SE(\hat{\beta}^{\star})\).

  5. Now suppose that the response variable scores are transformed according to the formula \(\mathrm{y}^{\star\star}=5(\mathrm{y}+2)\) and that \(\mathrm{y}^{\star\star}\) is regressed on \(\mathrm{x}\). Without redoing the regression calculation in detail, find \(SE(\hat{\beta}^{\star\star})\) and \(t^{\star\star}_0= \hat{\beta}^{\star\star}/SE(\hat{\beta}^{\star\star})\).

  6. In general, how are the hypothesis tests and confidence intervals for \(\beta\) affected by linear transformations of \(\mathrm{y}\) and \(\mathrm{x}\)?

Hw2_byhands part

Hw2_byhands part

Feedback comments etc.

If you have any comments about the homework, or the class, please write your feedback here. We love to hear your opinions.